-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Flaky Test SpecificClusterManagerNodesIT.testElectOnlyBetweenClusterManagerNodes #16021
base: main
Are you sure you want to change the base?
Conversation
❕ Gradle check result for 1b3920b: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Checking the history of this test, has this been flaky for more than a year? The only code related to cluster manager election that I've been able to find changed more recently than ~5 years ago (besides renaming) is the introduction of DecommisionService in 2023. |
1338c91
to
3823b1c
Compare
@msfroh I'm not clearly why it wasn’t reported before, maybe the frequency of occurrence is very low and we ignored it? This issue #15944 collects some of the recent case. Logically speaking, when the cluster has no cluster manager, It's right to return null from |
…terManagerNodes Signed-off-by: kkewwei <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #16021 +/- ##
============================================
+ Coverage 71.92% 71.98% +0.06%
Complexity 64400 64400
============================================
Files 5281 5281
Lines 300995 300995
Branches 43479 43479
============================================
+ Hits 216491 216672 +181
+ Misses 66793 66569 -224
- Partials 17711 17754 +43 ☔ View full report in Codecov by Sentry. |
@msfroh @gaobinlong please help confirm it when you are free. |
This PR is stalled because it has been open for 30 days with no activity. |
return null; | ||
} | ||
}; | ||
assertThat(clusterManagerName.get(), equalTo(nextClusterManagerEligableNodeName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we swallow the NPE with the try/catch above, won't this test still fail on the assertion here since clusterManagerName.get()
would be null
and nextClusterManagerEligableNodeName
shouldn't be null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jed326 assertBusy
allows the assert failed in 10s, so it seems ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add an additional assert before this to evaluate it is not null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kkewwei for raising a fix
.getName(), | ||
equalTo(nextClusterManagerEligableNodeName) | ||
); | ||
Supplier<String> clusterManagerName = () -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we move the supplier ouside the
assertBusy
block ? - nit: could be rename as
getClusterManagerIfElected
.nodes() | ||
.getClusterManagerNode() | ||
.getName(); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of catching a generic exception, could we catch a specific exception related to ClusterManager Not found.
return null; | ||
} | ||
}; | ||
assertThat(clusterManagerName.get(), equalTo(nextClusterManagerEligableNodeName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add an additional assert before this to evaluate it is not null.
Description
The case is as follows:
When the
node_t1
is excluded from the vote config, and the cluster starts a new leader election, but the the nodenode_t2
hasn't been elected as the new leader.At the moment, we send request to get the ClusterManager, we first get ClusterManager name, and leads to the NullPointerException.
internalCluster().nonClusterManagerClient()
-> ......->getClusterManagerName()
OpenSearch/test/framework/src/main/java/org/opensearch/test/InternalTestCluster.java
Line 2171 in f0ea056
Related Issues
Resolves #15944 #16015
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.